Evals Summary

Dataset: evals.parquet

This examples summarizes the results of 4 evaluation tasks (gpqa_diamond, aime2024, mmlu_pro, cybench, and swe_bench) across 4 models (OpenAI o4-mini and o3, and Anthropic Claude Sonnet 3.7 and 4.0).

Bar Chart

We start with a simple bar chart faceted by evaluation task:

Code
from inspect_viz import Data
from inspect_viz.plot import plot, legend
from inspect_viz.mark import bar_y

evals = Data.from_file("evals.parquet")

plot(
    bar_y( 
        evals, 
        x="model", 
        fx="task_name",
        y="score_headline_value",
        fill="model",
    ),
    legend=legend("color", location="bottom"),
    x_label=None, fx_label=None, x_ticks=[],
    y_label="score", y_domain=[0, 1.0]
)
1
Facet the x-axis (i.e. create multiple groups of bars) by task name.
2
We don’t need an explicit “model” or “task_name” label as they are obvious from context. We also don’t need ticks b/c the fill color and legend provide this.
3
Ensure that y-axis shows the full range of scores (by default it caps at the maximum).

Confidence Interval

Here, we add a confidence interval for each reported score by adding a rule_x() mark. Note that we compute the confidence interval range dynamically using a sql() transform:

Code
from inspect_viz.mark import rule_x
from inspect_viz.transform import sql

def ci_value(direction):
    Z_ALPHA = 1.960
    return sql(
        "score_headline_value" +
        f"{direction}" +
        f"({Z_ALPHA} * score_headline_stderr)"
    )

plot(
    bar_y( 
        evals, x="model", fx="task_name", 
        y="score_headline_value",
        fill="model",
    ),
    rule_x(
        evals,
        x="model",
        fx="task_name",
        y1=ci_value("-"),
        y2=ci_value("+"),
        stroke="black",
        marker="tick-x",
    ),
    legend=legend("color", location="bottom"),
    x_label=None, fx_label=None, x_ticks=[],
    y_label="score", y_domain=[0, 1.0]
)
1
Dynanically compute each side of the confidence interval using a sql() transform.
2
Draw the confidence interval using a rule_x() mark.

Filtering

Here we add filtering inputs to enable viewing a single model and/or single task at a time. We use the hconcat() and vconcat() functions to layout the inputs and the plot.

Code
from inspect_viz.input import select
from inspect_viz.layout import hconcat, vconcat

vconcat(
    hconcat(
        select(evals, label="Model", column="model"),
        select(evals, label="Task", column="task_name"),
    ),
    plot(
        bar_y( 
            evals, 
            x="model", 
            fx="task_name",
            y="score_headline_value",
            fill="model",
        ),
        legend=legend("color", location="bottom"),
        x_label=None, fx_label=None, x_ticks=[], 
        y_label="score", y_domain=[0, 1.0] 
    )
)
1
Use layout functions to arrange inputs and plot.
2
select() inputs and bar_y() are automatically bound to the dataset selection for evals.